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Abstract 

Current deterministic algorithms for the construction of polar codes can only be argued to be practical for channels with small 
input alphabet sizes. In this paper, we show that any construction algorithm for channels with moderate input alphabet size which 
follows the paradigm of “degrading after each polarization step” will inherently be impractical with respect to a certain “hard” 
underlying channel. This result also sheds light on why the construction of LDPC codes using density evolution is impractical 
for channels with moderate sized input alphabets. 

Index Terms 

Polar codes, LPDC, construction, density evolution, degrading cost. 

I. Introduction 

Polar codes m are a novel family of error correcting codes which are capacity achieving and have efficient encoding and 
decoding algorithms. Originally defined for channels with binary input, they were soon generalized to channels with arbitrary 
input alphabets IS). Although polar codes are applicable to many information theoretic settings, the channel coding setting is 
the one we consider in this paper. More specifically, we consider the symmetric capacity setting discussed in III and l^.In 
this setting, a polar code is gotten by unfreezing channels with probability of error at most 2“^", where n is the code length 
and /3 > 0 is a suitably chosen constant. A synthesized channel is gotten by repeatedly applying polar channel transforms. 
The plus and minus polar transforms were defined in m. Other transforms are possible El, 0 El, see also Q. 

Since the synthesized channels have an output alphabet size which grows exponentially in the code length n, calculating their 
probability of misdecoding is intractable if approached directly. To the author’s knowledge, the only tunable and deterministic 
methods of circumventing this difficulty involve approximating some of the intermediate channels by channels which have a 
manageable output alphabet size. Simply put: before the first polarization step and after each polarization step, approximate 
the relevant channel by another channel having a prescribed output alphabet size. Doing so ensures that the channel output 
alphabet sizes do not grow intractably. 

The above “approximate after each polarization step” idea has its origins in density evolution Page 217], a method to 
evaluate the performance of LDPC code ensembles. Density evolution was suggested as a method of constructing polar codes 
in 0. In order to bound the misdecoding probability of a synthesized channel — as opposed to only approximating it — one 
can force the approximating channel to be either (stochastically) degraded or upgraded with respect to it. An efficient algorithm 
for such a degrading/upgrading approximation was introduced for the binary-input case in IfTOl and analyzed in IfTTI . See also 
EH for an optimal degrading algorithm. Algorithms for degrading and upgrading non-binary channels were given in M and 
na, respectively. See also M- On a related note, the construction of polar codes was recently proven to be polynomial m, 
for an arbitrary hut fixed input alphabet size. 

For a fixed input distribution, a degrading approximation results in a channel with reduced mutual information between 
input and output. This drop in mutual information should ideally be kept small. The reason for this will be elaborated on in 
Section nni In brief, the reason is that such a drop necessarily translates into a drop in code rate, both in the polar coding 
setting as well as in the LDPC setting. Thus, a non-negligible drop in mutual information due to approximation necessarily 
means a coding scheme which is not capacity achieving. 

In this paper, we define a specific channel. With respect to this channel, we derive lower bounds on the drop in mutual 
information as a function of the channel input alphabet size, q, and the number of output letters of the approximating channel, 
L. Simply put, the main result of this paper is that for moderate values of q, a modest drop in mutual information translates into 
the requirement that L be unreasonably large, in the general case. It seems to be common knowledge that constructing capacity 
achieving LDPC or polar codes for channels with such input alphabet sizes is generally hard; this is commonly referred to as 
the “curse of dimensionality”. This paper is an attempt to quantify this hardness, under assumptions that are in line with what 
is currently done. 

The paper was presented in part at the 2015 IEEE International Symposium on Information Theory, Hong Kong, June 14 — June 19, 2015. Research 
supported in part by the Israel Science Foundation grant 1769/13. 
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The structure of this paper is as follows. Section |II] introduces the main result of the paper, after stating the needed notation. 
Section |III] explains the implications of the result to the hardness of constructing polar codes and LPDC codes. Section |IV] 
contains a specialization of Holder’s defect formula to our setting. Section lYl defines and analyzes the previously discussed 
channel. 


II. Notation and problem statement 

We denote a channel by fP: T" The probability of receiving y G y given that x G X was transmitted over W is 

denoted W{y\x). All our channels will be defined over a finite input alphabet X, with size q = \X\. Unless specifically stated 
otherwise, all channels will have a finite output alphabet, denoted out(IU) = y. Thus, the channel output alphabet size is 
denoted |out(IU)|. 

We will eventually deal with a specific channel, which turns out to be symmetric (as defined in ||3] page 94]). In addition, the 
input distribution we will ultimately assign to this channel turns out to be uniform. However, we would like to be as general 
as possible wherever appropriate. Thus, unless specifically stated otherwise, we will not assume that a generic channel W is 
symmetric. Each channel will typically have a corresponding input distribution, denoted Px = ■ Note that Px need not 

necessarily be uniform and need not necessarily be the input distribution achieving the capacity of W. We denote the random 
variables corresponding to the input and output of lU by X = X^^'> and Y = respectively. The distribution of Y is 

denoted Py = P^\ That is, for y Gy, 

Pviy) = Px{x)W{y\x) . 

The mutual information between X and Y is denoted as 

I{W) = I{X-Y) , 

and is henceforth measured in nats. That is, all logarithms henceforth are natural. Note that I{W) typically does not equal the 
capacity of W. 

We say that a channel Q.X -G Z k (stochastically) degraded with respect to W : X ^ if there exists a channel 
y ^ Z such that the concatenation of $ to lU yields Q. Namely, for all x G X and z G Z, 

Q{z\x) = ^W{y\x)<^{z\y) . (1) 

yay 

We denote Q being degraded with respect to VU as Q -< W. 

For input alphabet size q=\X\ and specified output alphabet size L, define the degrading cost as 

DC(g,L)=sup [I[W) - I{Q)) . (2) 

w,Px Q '■ 

|out(Q)|<L 

Namely, both W and Q range over channels with input alphabet X such that \X\ = g; both channels share the same input 
distribution Px, which we optimize over; the channel Q is degraded with respect to VU; both channels have finite output 
alphabets and the size of the output alphabet of Q is at most L; we calculate the drop in mutual information incurred by 
degrading W to Q, for the “hardest” channel W, the “hardest” corresponding input distribution Px, and the corresponding 
best approximation Q. 

Note that the above explanation of O is a bit off, since the outer qualifier is “sup”, not “max”. Namely, we might need to 
consider a sequence of channels W and input distributions Px- Note however that the inner qualifier is a “min”, and not an 
“inf”. This is justified by the following claim, which is taken from IIT^ Lemma 1]. 

Claim 1: Let W: X y and Px be given. Let L > 1 be a specified integer for which |]y| > L. Then, 

inf iI(W) - I{Q)) 

Q : QaVE, ^ " 

\out{Q)\<L 

is attained by a channel Q: X ^ Z for which it holds that |out((5)| = L and 

Q{z\x) = '^W{y\x)^{z\y) , T>(z|?/) S {0,1} , ^ $(z|y) = 1 . 

y&y zez 

Namely, Q is gotten from W by defining a partition of y and mapping with probability 1 all symbols in Ai to a 

single symbol Zi G Z, where Z = 

In ifTSll . an upper bound on DC{q,L) is derived. Specifically, 

DC(g,L) <2(7- . 


The above has been recently sharpened M Lemma 8 ] to 
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DC(g,L) <2-q^+^ 


\l) 


These bounds are constructive and stem from a specific quantizing algorithm. Specifically, the algorithm is given as input 
the channel W, the corresponding input distribution Px, and an upper bound on the output alphabet size, L. Note that for a 
fixed input alphabet size q and a target difference e such that DC{q,L) < e, the above implies that we take L proportional to 
That is, for moderate values of q, the required output alphabet size grows very rapidly in 1/e. Because of this, |[T3l 
explicitly states that the algorithm can be considered practical only for small values of q. 

We now quote our main result: a lower bound on DC(q, L). Let (Tg_i be the constant for which the volume of a sphere in 
of radius r is Namely, 

= r(^ + i) ’ 

where L is the Gamma function. 

Theorem 2: Let q and L be specified. Then, 

DC(g, L) > f ~ \ ■ ( - ] -—) . (3) 

^- 2(9 + 1 ) \a,.^-{q-l)\) \L) 

The above bound is attained in the limit for a sequence of symmetric channels, each have a corresponding input distribution 
which is uniform. 

The consequences of this theorem in the context of code construction will be elaborated on in the next section. However, 
one immediate consequence is a vindication of sorts for the algorithm presented in na. That is, for q fixed, we deduce from 
the theorem that the optimal degrading algorithm must take the output alphabet size L at least proportional to 
where e is the designed drop in mutual information. That is, the adverse effect of L growing rapidly with 1/e is an inherent 
property of the problem, and is not the consequence of a poor implementation. For a numerical example, take 9 = 16 and 
e = 10“®. The theorem states that the optimal degrading algorithm must allow for a target output alphabet size L ss 10^^. 
This number is for all intents and purposes intractable. 

We note that the term multiplying (l/L)^/(‘^“^^ in (l3]l can be simplified by Stirling’s approximation. The result is that 


DC(9,T) > 


47r(9-l) \lJ 


and the approximation becomes tight as 9 increases. 

Note that the RHS of the above is eventually decreasing in 9 , for L fixed. However, it must be the case that DC( 9 , L) 
is increasing in 9 (to see this, note that the input distribution can give a probability of 0 to some input symbols). Thus, we 
conclude that our bound is not tight. 


III. Implications for code construction 

We now explain the relevance of our result to the construction of both polar codes and LDPC codes. In both cases, a “hard” 
underlying channel is used, with a corresponding input distribution that is uniform. Let us explain: for 9 and L fixed, and for 
a uniform input distribution, we say that a channel is hard if the drop in mutual information incurred by degrading it to a 
channel with at most L output letters is, say, at least half of the RHS of Q. Theorem |2] assures us that such hard channels 
exist. Put another way, the crucial point we will make use of is that for a hard channel, the drop in mutual information is at 
least proportional to 

A. Polar codes 

As explained in the introduction, the current methods of constructing polar codes for symmetric channels involve approxi¬ 
mating the intermediate channels by channels with a manageable output alphabet size. Specifically, the underlying channel — 
the channel over which the codeword is transmitted — is approximated by degradation before any polarization operation is 
applied. Now, for 9 fixed and L a parameter, consider an underlying hard channel, as defined above. Denote the underlying 
channel as W, and let the result of the initial degrading approximation be denoted by Q. 

The key point to note is that the construction algorithm cannot distinguish between W and Q. That is, consider two runs 
of the construction algorithm, one in which the underlying channel is W and another in which the underlying channel is Q. 
In the first case, the initial degradation produces Q from W. In the second case, the initial degradation simply returns Q, 
since the output alphabet size is at most L, and thus no reduction of output alphabet is needed. Thus, the rate of the code 
constructed cannot be greater than the symmetric capacity of Q, which is at most W — e. We can of course make e arbitrarily 
small. However, this would necessitate an L at least proportional to For rather modest values of 9 and e, this is 

intractable. 
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B. LDPC codes 

The standard way of designing an LDPC code for a specified underlying channel is by applying the density evolution 
algorithm 18] Section 4.4]. To simplify to our needs, density evolution preforms a series of channel transformations on the 
underlying channel, which are a function of the degree distribution of the code ensemble considered. Exactly as in the polar 
coding setting, these transformations increase the output alphabet size to intractable sizes. Thus, in practice, the channels are 
approximated. If we assume that the approximation is degrading — and it typically is — the rest of the argument is now 
essentially a repetition of the argument above. In brief, consider an LDPC code designed for a hard channel W. After the 
first degrading operation, a channel Q is gotten. The algorithm must produce the same result for both W and Q being the 
underlying channel. Thus, an ensemble with rate above that of the symmetric capacity of Q will necessarily be reported as 
“bad” with respect to both W and Q. Reducing the mutual information between W and Q is intractably costly for moderate 
parameter choices. 


IV. Preliminary lemmas 

As a consequence of the data processing inequality, if Q is degraded with respect to W, then I{W) — I{Q) > 0. In this 
section, we derive a tighter lower bound on the difference. To that end, let us first define ri{p) as 

vip) = ~P ■ Inp ) 0 < p < 1 , 


where 77 ( 0 ) = 0. Next, for a probability vector p = {px)xgx, define 

^(p) = X! ~p^ ■ = X! p^p^'^ ■ 

For A = {yi,y 2 , ■ ■ ■ ,yt} L y, define the quantity A (A) as the decrease in mutual information resulting from merging all 
symbols in A into a single symbol in Q. Namely, define 


where 


and 


A(A) = -K \ h 




- 1 


= '^PY{y), 0j = PyiV])!'^ , 

y&A 


pO) = (P(x = a;|F = y,))xex ■ 


(4) 

(5) 

(6) 


The following claim is easily derived. 

Claim 3: Let W, Q, Px, L, and {Ai)P^-^ be as in Claim [T] Then, 

L 

/(fL)-/(g) = ^A(A,) . (7) 

i=r 


Although the drop in mutual information is easily described, we were not able to analyze and manipulate it directly. We 
now aim for a bound which is more amenable to analysis. As mentioned, by the concavity of h and Jensen’s inequality, we 
deduce that A(Ai) > 0. Namely, data processing reduces mutual information. We will shortly make use of the fact that h is 
strongly concave in order to derive a sharper lower bound. To that end, we now state Holder’s defect formula iflTll (see ifTSi 
Page 94] for an accessible reference). 

As is customary, we will phrase Holder’s defect formula for U-convex functions, although we will later apply it to h which 
is n-concave. We remind the reader that for twice differentiable U-convex functions, /: D ^ R, 79 C R", the Hessian of /, 
denoted 


VV(a) 


/ d^f{a) \ 


is positive semidefinite on the interior of 79 m page 71]. We denote the smallest eigenvalue of V^/(a) by Ainin(V^/(a)). 

Lemma 4: Let f{a ): 79 R be a twice differentiable convex function defined over a convex domain 79 C M”. Let m > 0 
be such that for all a in the interior of 79, 

m < Amin(V^/(a)) 


Fix S 79 and let be non-negative coefficients summing to 1. Denote 

t 

a = E! 

7=1 
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and 


Then, 


j=i 


la, — a\'^ 


= ^EE^. 


Oii - akW 


j=i k=i 


i=i 3 


Proof: Let A be a diagonal matrix having all entries equal to m. By dehnition of m, we have that the function g{a) = 
/(a) — ia^Aa is positive semidehnite for all a G D. Thus, by Jensen’s inequality, 

^ 9ig[a^] - O^ai] > 0 . 

i i 

Replacing g{a) in the above expression by /(a) — and rearranging yields the required result. ■ 

We now apply Holder’s inequality in order to bound A(A). For A = {yi,y 2 , ■ ■ ■ ,yt] Q y, define 


A(^)= 2 E^^ 

i=i 


— p 


= iEE^a 


j=l k=l 


( 8 ) 


where tt and 9j are as in (|5]), p^-^^ is as dehned in (|6ll, and 


p = E ■ 

i=i 

The following is a simple corollary of Lemma |4] 

Corollary 5: Let W, Q, Px, L, and be as in Claim [T] Then, for all 1 < i < L, 


Thus, 


A(A0 > A(A,) . 


I{W)-IiQ)>^AiA,) . 


(9) 


( 10 ) 


Proof: The second inequality follows from the first inequality and (|7]). We now prove the first inequality. Let D = [0,1]", 
the set of vectors of length n having each entry between 0 and 1. Since the second derivative of p is g"{p) = —1/p, we 
conclude Amin(—^(p)) > 1 for all p in the interior (0,1)". That is, we take m = 1 in Lemma|4] Since h is continuous on D, 
our result follows by Lemma |4] and standard limiting arguments. ■ 

V. Bounding the degrading cost 

We now turn to bounding the degrading cost. As a hrst step, we dehne a channel W for which we will prove a lower bound 
on the cost of degrading. 


A. The channel W 

For a specified integer M > 1, we now dehne the channel W = Wm, where \N: X ^3^. The input alphabet is X = 
{1, 2,... ,q}, of size 1^1 = q. The output alphabet consists of vectors of length q with integer entries, dehned as follows: 


H 

3 ^ = { ■■ jl,j2,---,jq > 0 , ^ > = m | . 

The channel transition probabilities are given by 


( 11 ) 


X — 1 




9 • Ja 




Lemma 6: The above dehned W is a valid channel with output alphabet size 

+ q-X 


I out (W) I = 


9-1 


( 12 ) 
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Proof: The binomial expression for the output alphabet size follows by noting that we are essentially dealing with an 
instance of “combinations with repetitions” ll20l Page 15]. Obviously, the probabilities are non-negative. It remains to show 
that for all x € X, 


E 




Since the above is independent of x, we can equivalently show that 


E 

O'l 


q-{jl+j 2 + -f jq) 


By the definition of y in (fTTI) . the denominator above equals q ■ M. Since we have already proved (fTZt . the result follows. ■ 
Recall the definition of symmetry in ||3 page 94]: Let W : T" —> 3^ be a channel. Define the probability matrix associated 
with PL as a matrix with rows indexed by X and columns by y such that entry {x,y) G X x y equals W{y\x). The channel 
W is symmetric if the output alphabet can be partitioned into sets, and the following holds: for each set, the corresponding 
submatrix is such that every row is a permutation of the first row and every column is a permutation of the first column. 
Lemma 7: The above defined W is a symmetric channel. 

Proof: Define the partition so that two output letters, {ji,j 2 , ■ ■ ■ ,jq) and {j'i,j 2 , ■ ■ ■ ,jg), are in the same set if there exists 
a permutation tt : X ^ X such that jx = for all x G X. ■ 

Since W is symmetric, it follows from ||3l Theorem 4.5.2] that the capacity achieving distribution is the uniform distribution. 
Thus, we take the corresponding input distribution as uniform. Namely, for all x G X, 


P{X = x) = - . 

q 

As a result, all output letters are equally likely (the proof is similar to that of Lemma nil. 
Denote the vector of a posteriori probabilities corresponding to {ji,j 2 , ■ ■ ■ tjq) as 


Pijl,j2,---,jq) = ( PiX = X\Y = (3l,j2,- • ■ ,jq)) )Ll ' 


A short calculation gives 


P{juj2,---,jq) 


In light of the above, let us define the shorthand 


(jy h_ 

yp'"' Mj 




(13) 


With this shorthand in place, the label of each output letter ■ ■ ■ ,jq) G y is the corresponding a posteriori probability 

vector p(ji,32j • ■ • ,jq)- Thus, we gain a simple expression for A(A). Namely, for A cy, 


A(A) 


E11P-PII2. p = EmP- 

q-l ) PGA PGA 1^1 


converges” to the channel W^: X —)■ X x [0, Vf which we now define. Given 


We remark in passing that as M —>• 00 , W 
an input x, the channel picks (^ 1 , (/? 2 ,..., as follows: (/?i, (/J 2 ,..., Pq-i are picked according to the Dirichlet distribution 
79(1, while ipq is set to 1 — X]x=i 3^^*- i^’ ¥^ 2 , ■ • ■, Pq) is chosen uniformly from all possible probability 

vectors of length q. Then, the input x is transformed into x + i (with a modulo operation where appropriat^H) with probability 
q)i. The transformed symbol along with the vector ((^ 1 , ip 2 , ■ • ■, Pq) is the output of the channel. 


B. Optimizing A' 

Our aim is to find a lower bound on A (A), where A C 3 ^ is constrained to have a size |A| = t. Recalling (fTJl i. note that 
all output letters p = (px)^=i G y must satisfy the following three properties. 

1) All entries Px are of the form jx/M, where jx is an integer. 

2) All entries Px sum to 1. 

3) All entries Px are non-negative. 

Since all entries must sum to 1 by property |2] entry pg is redundant. Thus, for a given p G 3^, denote by p' the first q — 1 
coordinates of p. Let A' be the set one gets by applying this puncturing operation to each element of A. Denote 


A(A') A 


1 


2("A‘) 


i: IIP' 

p'eA' 


-P' 


2 

2 ’ 


(14) 


^To be precise, x is transformed into 1 -\- {x — 1 -\- i mod q). 
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One easily shows that 

A{A') < A{A) , (15) 

thus a lower bound on A(A') is also a lower bound on A(A). 

In order to bnd a lower bound on A(A') we relax constraint [3] above. Namely, a set A' with elements p' will henceforth 
mean a set for which each element p' = (px)xZi has entries of the form = j/M, and each such entry is not required to be 
non-negative. Our revised aim is to find a lower bound on A(A') where A' holds elements as just defined and is constrained to 
have size t. The simplification enables us to give a characterization of the optimal A'. Informally, a sphere, up to irregularities 
on the boundary. 

Lemma 8: Let f > 0 be a given integer. Let A' be the set of size \A'\ = t for which A (A') is minimized. Denote by p' the 
mean of all elements of A'. Then, A' has a critical radius r; all p' for which ||p^ — p ^||2 < are in A' and all p' for which 
Up' — p '||2 > are not in A'. 

Proof: We start by considering a general A'. Suppose p'(l) € A' is such that = ||p'(l) — p'|| 2 - Next, suppose that 
there is a p'(2) ^ A' such that ||p'(2) — p'H^ < Then, for 

B’= A'VJ {p'(2)} \ {p'(l)} , A{B') < A{A') . 


To see this, first note that 

Ilp'-p'll2< Ilp'-p'll2 ■ (16) 

p'es' p'gA' 

Next, note that the RHS of (fThl l is A(A'), but the LHS is not A{B'). Namely, p' is the mean of the vectors in A' but is not 
the mean of the vectors in B'. However, X^p'es' IIP^ “ *^112 1^ minimized for u' equal to the mean of the vectors in B' (to 
see this, differentiate the sum with respect to every coordinate of u'). Thus, the LHS of (fThl l is at least A{B') while the RHS 
equals A (A'). 

The operation of transforming A' into B' as above can be applied repeatedly, and must terminate after a finite number of 
steps. To see this, note that the sum X]p'e> 4 ' IIP^ “ P '\\2 1^ constantly decreasing, and so is upper bounded by the initial sum. 
Therefore, one can bound the maximum distance between any two points in A!. Since the sum is invariant to translations, we 
can always translate A! such that its members are contained in a suitably large hypercube (the translation will preserve the 
1/M grid property). The number of ways to distribute \A'\ grid points inside the hypercube is finite. Since the sum is strictly 
decreasing and non-negative, the number of steps is finite. The ultimate termination implies a critical r as well as the existence 
of an optimal A'. ■ 

Recall that a sphere of radius r in has volume crq_ir'J“^, where Ug-i is a well known constant lITTl Page 411]. Given 
a set A', we define the volume of A' as 

Vol(A') A . 

V ! J^q-1 

For optimal A' as above, the following lemma approximates Vol(A') by the volume of a corresponding sphere. 

Lemma 9: Let A' be a set of size t for which A (A') is minimized. Let the critical radius be r and assume that r < 4. Then, 

Vol(A') = -I- e,_i(f) . 


The error term eq-i{f) is bounded from both above and below by functions of M alone (not of t) that are o(l) (decay to 0 
as M ^ cxd). 

Proof: Let 6: > {0,1} be the indicator function of a sphere with radius r centered at p'. That is. 


^(p') 


1 ||p'-p'||2<r2 

0 otherwise . 


Note that 1) 5 is a bounded function and 2) the measure of points for which 5 is not continuous is zero (the boundary of a 
sphere has no volume). Thus, 5 is Riemann integrable ll22] Theorem 14.5]. 

Consider the set T*' which is [— 4r, shifted by p'. Since 'k' contains the above sphere, the integral of 5 over 'k' must 

equal We now show a specific Riemann sum ll22] Definition 14.2] which must converge to this integral. Consider 

a partition of 'k' into cubes of side length 1/M, where each cube center is of the form (/i/M,/ 2 /M,... ,/q_i/M) and the 
jx are integers (the fact that cubes at the edge of 4*' are of volume less than 1/M^~^ is immaterial). Define [p' G A'] as 1 if 
the condition p' G A' holds and 0 otherwise. We claim that the following is a Riemann sum of 5 over 4'' with respect to the 
above partition. 


E 


1 

M9-1 


[p' G A'] 




To see this, recall that A' has critical radius r. 




The absolute value of the difference between the above sum and can be upper bounded by the number of cubes 

that straddle the sphere times their volume (any finer partition will only affect these cubes). Since r < 4, this quantity 

must go to zero as M grows, no matter how we let r depend on M. ■ 

Lemma 10: Let A' be a set of size t for which A(^') is minimized. Let the critical radius be r and assume that r < 4. 
Then, 


A(/l') = 




2{q + 1) 




The error term eq-i{t) is bounded from both above and below by functions of M alone (not of t) that are o(l) (decay to 0 
as M ^ oo). 

Proof: Let the sphere indicator function 5 and the bounding set T*' be as in the proof of Lemma |9] Consider the sum 

E ^llp'-p'lLlp'e.4'l. (17) 

On the one hand, by (fl4l i. this sum is simply 

2 / M+q—l\ 

. (18) 
On the other hand, (fTTI i is the Riemann sum corresponding to the integral 


/ ||p'-p'|l2[p'e74Vp', 

with respect to the same partition as was used in the proof of Lemma 0 As before, the sum must converge to the integral, 
and the convergence rate can be shown to be bounded by expressions which are not a function of t. 

All that remains is to calculate the integral. Denote by sphereq_i(r) C the sphere centered at the origin with radius 

r. After translating p' to the origin, the integral becomes 

f (xl+X2-\ - \-Xq_i) dxidx2 ■ ■ ■ dXq-l = ^ ( 19 ) 

,7sphereq_j^(r) “f 1 

where the RHS is derived as follows. After converting the integral to generalized spherical coordinates 


a;i=rcos(0i) , 
a;2=rsin(0i) cos(02) , 


a;g_ 2 =r sin( 0 i) sin( 02 ) • • • sin( 0 q_ 2 ) cos( 6 »g_i) , 
a;g_i=r sin( 6 »i) sin( 02 ) ■ • ■ sin( 6 »q_ 2 ) sin( 0 g_i) , 

we get an integrand that is times the integrand we would have gotten had the original integrand been 1 (this follows by 
applying the identity sin^ 0 + cos^ 0 — \ repeatedly). We know that had that been the case, the integral would have equaled 

Since ( fT9] l must equal the limit of (fTsT l. and since the fraction in (fTsT i converges to 2/(9 — 1 )!, the claim follows. ■ 

As a corollary to the above three lemmas, we have the following result. The important point to note is that the RHS is 
convex in Vol(A'). 

Corollary 11: Let f > 0 be a given integer. Let A' be a set of size t and assume that 

max Up' — p'll? < 2 . ( 20 ) 

p'eA' 

Then, 

K{A') > . Vol(A')?^ + o(l) , (21) 

2 (g+ 1 ) ■ (cr,_i)9-i 

where the o(l) is a function of M alone and goes to 0 as M 00 . 

Proof: Let B' be the set of size t for which A{B') is minimized. The proof centers on showing that the critical radius 
of B' is at most 4. All else follows directly from Lemmas |3 and [TOl Assume to the contrary that the critical radius of B' is 
greater than 4. Thus, up to translation. A' is a subset of B'. But this implies that A(A') < A{B'), a contradiction. ■ 
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C. Bounding DC{q, L) 

We are now in a position to prove Theorem |2] Recall that Ai is the set of output letters in y which get mapped to the letter 
Zi € Z. Also, recall that A' is simply Ai with the last entry dropped from each vector. 

Proof of Theorem \2} By combining (|2]l, (fTOl l. (fTST l. and (1211 1. we have that as long as condition (l20l i holds for all A', 
1 < j < L, the degrading cost DC{q,L) is at least 


2(g+ 1) • (ct9-i)o-i ^ 


( 22 ) 


Recalling that the elements of A are probability vectors, we deduce that condition dlOl i must indeed hold. Indeed, 

IIp'IIs <2. 


I|p'-p'll2 < I|p'-p'll2 < IIp'I 


The hrst inequality follows from the fact that is less than p for 0 < p < 1. The second inequality is the triangle inequality. 
The third inequality follows from the same reasons as the hrst. 

Next, recall that Vol(A9 = Vol(Ai), and thus 


L 

i=l 


|out(W)| 

M1-^ 


M<!-^ 


(23) 


Note that the RHS converges to l/{q — 1)! as M —> cxd. By convexity, we have that if we are constrained by (l23l l. then the 
sum in (l22l i is lower bounded by setting all Vol(A') equal to the RHS of (l23l l divided by L. Thus, after taking M —oo, we 
get (O. ■ 

Acknowledgments; The author thanks Eren §a§oglu and Igal Sason for their feedback. 
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